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® A microprocessor partaily decodes instructions 
retrieved from main memory before placing them 
into the . microprocessor's integrated instruction 
cache. Each storage location in the instruction cache 
induces two slots for decoded instructions. One slot 
controls one of the mi crop ro ce ssor's Integer pipe- 
lines and a port to the m i croprocessor's data cache. 
A second slot controls the second integer pipeline or 
one of the microprocessor's floating point units. The 
instructions retrieved from main memory are de- 
coded by a loader unit which decodes the instruc- 
tions from the compact form as stored In main 
memory and places them into the two slots of the 
instruction cache entry according to their functions. 
In addition, auxiliary information is placed in the 
cache entry along with the instruction to control 
pa/aJlei execution as weU as emulation of complex 
instructions. A bit in each instruction cache entry 
incicates wnetner the instructions in the two slots are 
independent so that they can be executed in par* 
a; lei. or ceoendent, so that they must oe executed 
sequentially. Using a single bit for thia purpose al- 
lows rwo dependent instructions to be stored in the 
slots of fte single cache entry. 
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Tne pro** invention relates to microproces- 
sor arcfutecturee and. in particular, to a micropro- 
cessor mat oartaify decoces instructions retrieved 
from external memory before storing them in an 
intemaJ instruction cache. Part airy decoded instruc- 
tions are retrieved from me internal cache for aimer 
parallel or sequential execution by multiple, par- 
allel pipelined functional units. 

in recent years, mere has been a trend in the 
design of niicroprocessor architectures from Com- 
plex Instructton Set Computers (CISC) toward Re- 
duced instruction Set Computers (RISC) to achieve 
high performance while maintaining simplicity of 
design. 

In a CISC architecture, each macroinstructlon 
received by me processor must be decoded inter- 
nally into a series of microinstruction subroutines. 
These microinstruction subroutines are then ex- 
ecuted by the microprocessor. 

In a RISC architecture, me number of macroin- 
structions which me processor can understand and 
execute is greatly reduced. Former, those macroin- 
structions which the processor can understand and 
execute are very basic so that me processor either 
does not decode mem into any microinstructions 
(me maaomstruction is executed in its macro form) 
or me decoded microiristruction subroutine involves 
very few mkronstructiona. 

The transition from CISC architectures to RISC 
architectures has been driven by two fundamental 
developments in computer design that are now 
being extensively applied to microprocessors. 
These developments are integrated cache memory 
and optimizing compilers. 

A cache memory is a small high speed buffer 
located between me processor and main memory 
to hold me instructions and data most recently 
used by the processor. Experience showe that 
computers very commonly exhibit strong character- 
istics of locality in their memory references That 
is. references tend to occur frequently either to 
locations mat have recently been referred to 
(temporal locality) or to locations that are near 
others that have recently been re/erred to (spatial 
locality). As a consequence of this locality, a cache 
memory mat is much smaller than main memory 
can capture the targe majority of a program's 
memory referencee. Because the cache memory ia 
relatively small, it can be realized from a raster 
memory technology man would be economical for 
me much larger main memory. 

Before me development of cache memory 
techniques for use in mainframe computers, mere 
was a large imbalance between me cycle time of a 
processor and that of memory. This imbalance was 
a result of me processor being realized from rela- 
tively high speed bipolar semiconductor technology 
and the memory being realized from much slower 



magnetic-core technology. The inherent speed cif- 
ference between logic and memory sourred me 
cevetcoment of complex instruction sets mat would 
permit the fetching of a single instruction from 
s memory to control me operation of me processor 
for several dock cycles. The imbalance between 
processor and memory speeds was also character- 
istic of me eariy generations of 32-bit microproces- 
sors, Those microprocessors would commonly take 
to 4 or 5 dock cycles for each memory access. 

Without me introduction of integrated cacfte 
memory, it is unlikely that RISC architectures 
would have become competitive with CISC archi- 
tectures. Because a RISC processor executes 
»s more instructions than does a CISC processor to 
accomplish the same task, a RISC processor can 
deliver performance equivalent to that of a CISC 
only if a faster and more expensive memory sys- 
tem is employed. Integrated cache memory en- 
» ables a RISC processor to fetch an instruction in 
the same time required to execute the instruction 
by an efficient processor pipeline. 

The second development mat has led to me 
effectiveness of RISC architectures is optimizing- 
as compilers. A compiler, which may be implemented 
in either hardware or software, translates a com- 
puter program from the high-level language used 
by the programmer into the machine language un- 
derstood by the computer. 
» For many years after the introduction of high- 
level languages, computers were stiU extensively 
programmed in assembly language. Assembly lan- 
guage is a low-level source code language employ- 
ing crude mnemonics mat are more easily rem em- 
as bered by the programmer man object-code or bi- 
nary equivalents. The advantages of improved soft- 
ware productivity and translatabiGty of high-level 
language programming were clear, but simple 
compilers produced Inefficient code. Eariy genere- 
40 tions of 32-bit microprocessors were developed 
with consideration for assembly language program- 
ming and simple compilers. 

More recently, advances in compiler technol- 
ogy are being applied to microprocessors. Optimiz- 
es ing compilers can analyze a program to allocate 
large numbers of registers efficiently and to man- 
age processor pipeline resources. As a conse- 
quence, high-level language programs can execute 
with performance comparable to or exceeding mat 
so of assembly programs. 

Many of the leading pioneers in RISC develop- 
ments have been compiler specialists who have 
demonstrated mat optimizing compilers can pro- 
duce highly efficient code for simple, regular a/- 
ss chitectures. 

Highly integrated single-chip microprocessors 
employ both pipelined and parallel execution to 
improve performance. Pipelined execution means 
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that white the microprocessor is fetching one in- 
struction, it can be simultaneously decoding a sec* 
end instruction, reading source operands for a third 
instruction, ca l cu la ti ng results for i fourth instruc- 
tion and writing results for a fifth instruction. Par- 
allel execution means that the microprocessor can 
initiate the operands for two or more Independent 
instructions simultaneously in separate functional 
units. 

As stated above, one of the main challenges in 
designing a high-performance microprocessor with 
multiple, pipelined functional units is to provide* 
sufficient instruction memory on-chip and to access 
the instruction memory efficiently to control the 
functional units. 

The requirement for efficient control of a micro- 
processor's functional units dictates a regular in* 
structton format that Is simple to decode. However, 
In conventional microprocessor architectures, 
instructions in main memory are highly encoded 
and of van able length to make efficient use of 
space in main memory and the limited oaridwfdth 
available between the mi cropr o ce ssor and the main 
memory. 

The present invention is defined by the in* 
dependent claims and provides a microprocessor 
that resolves the conflicting recuirernorrts for effi- 
cient use of main memory storage space and effi- 
cient control of the functional units by partially 
decoding instructions retrieved from main memory 
before placing them into the microprocessor's in- 
tegrated instruction cache. Preferably, each entry 
in the instruction cache has two slots for partially 
decoded instructions. One slot controls one of the 
microprocessor's execution pipelines and a port to 
its data cache. The second slot controls a second 
execution pipeline, or one of the microprocessor's 
floating point units, or a control transfer instruction. 
An instruction decoding unit or loader, decodes, 
instructions from their compact format as stored in 
main memory and places them into the two slots of 
the instruction cache entry acco r d ing to their func- 
tions. Auxiliary information may be also placed in 
the cache entry along with the instruction to control 
parallel execution and emulation of complex 
instructions. A bit in each cache entry may indicate 
whether the instructions in the two slots for that 
entry are independent so that they can be ex- 
ecuted in parallel, or dependent so that they must 
be executed sequentially. Using a single bit for this 
purpose allows two dependent instructions to be 
stored in the slots of a single cache entry. Other- 
wise, the two instructions would have to be stored 
in separate entries and only one-hart of the cache 
memory would be utilized in those two entries. 

A better understanding of the features and ad- 
vantages of the present invention will be obtained 
by reference to the following detailed description of 



the invention and accompanying drawings which 
set forth an illustrative embodiment in which the 
principles of the invention are utilized. 

Figure i is a block diagram illustrating a micro- 

s processor architecture that Incorporates the con- 
cepts of the present, invention. 

Figure 2 is a block diagram illustrating the 
structure of a partially decoded instruction cache 
utilized in the fig. 1 architecture. 

10 Figure 3 is a simplified representation of a 
parti airy decoded entry stored on the instruction 
cache shown in Rg. 2. 

Figure 4 is a block diagram illustrating the 
structure of the integer pipelines utilized in the 

70 microprocessor architecture shown in fig. 1 . 

Fig. 1 shows a block diagram of a micropro- 
cessor 10 that includes multiple, pipelined func- 
tional units that are capable of executing two 
instructions In parallel. 

20 The microprocessor 10 includes three main 
sections: an Instruction processor 12. an execution 
processor U and a bus interface processor 16. 

The instruction processor 12 includes three 
modules: an instruction loader id, an Instruction 

23 emulator 20 and an instruction cache 22. These 
modules toad instructions from the external system 
through the bus Interface pro ces s o r 10, store the 
instructions in the instruction cache 22 and provide 
pairs of instructions to the execution processor u 

30 for execution. 

The execution processor 14 includes two 4- 
stage pipelined integer execution units 24 and 26. 
a double-precision 5»stage pipelined floating point 
execution unit 28. and a 1024 byte data cache 30. 

38 A set of integer registers 32 services the two 
integer units 24 and 26; similarly, a set of floating 
point registers 34 services the floating point execu- 
tion unit 23. 

The bus Interface processor 1 6 includes a bus 

40 interface unit 36 and a number of system modules 
38. The bus interface unit 36 controls the bus 
accesses requested by both the instruction ^proces- 
sor 12 and the execution processor 14. In the 
illustrated embodiment the system modules 3d 

46 include a timer 40, a direct memory access (DMA) 
controller 42, an interrupt control unit (ICU) 44 and 
I/O buffers 46. 

As described in greater detail below, the in- 
struction loader 18 partially decodes instructions 

so retrieved from main memory and places the par- 
tially decoded instructions in the instruction cacne 
22. That is, the instruction loader 18 translates an 
instruction stored in main memory (not shown) into 
the decoded format of the instruction cache 22. As 

sa will also be described in greater detail below. ifie 
instruction loader 18 is also responsible for check- 
ing whether any dependencies exist between con- 
secutive instructions that are paired in a single 
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instruction cacne entry. 

The instruction cache 22 contains 512 entries 
for partially-decoded instructions. 

In accordance with one aspect of the oresent 
invention, ana as explained in greater detail below, 
eacn entry in the instruction cacne 22 contains 
either one or two instructions stored In a partiaily- 
decoded format for efficient control of trie various 
functional units of trie microprocessor 10. 

in accordance with another aspect of the 
present invention, each entry in instruction cache 
22 also contains auxBary irrformation that indicate* 
whether the two instructions stored in that entry are 
independent so that they can be executed in par- 
allel, or dependent so that they must be executed 
sequentially. 

The instruction emulator 20 executes special 
instructions defined in the instruction set of the 
microprocessor 10. When the instruction loader 18 
encounters such an instruction, it transfers control 
to the emulator 20. The emulator is responsible for 
generating a sequence of core instructions (defined 
below) mat perform the function of a single com- 
plex instruction (denned below), tn this regard the 
emulator 20 provides ROM-resident microcode. 
The emulator 20 also controls exception processing 
and self-test operations. 

The two 4- stage integer pipeSnee 24 and 28 
perform basic arithmeec/logicai operations and 
memory references. Each integer pipeline 24.28 
can execute instructions at a throughput of one per 
system dock cycle. 

The floating point execution unit 28 includes 
three sub-units that perform single-precision and 
double-precision operations. An FPU adder sub- 
unit 28a is responsible for add and convert oper- 
ations, a second sub-unit 28b is responsible for 
multiply operations and a third sub-unit 28c is 
responsible for divide operations. 

When add and multiply operations are after* 
nateiy executed, the floating point execution unit 28 
can execute instructions at a throughput of one 
instruction per system dock cycle. 

Memory referen ces for the floating point ex- 
ecution unit 28 are controlled by one of the integer 
pipelines 24.28 and can be performed in parallel to 
floating-point operations. 

Oata memory references are performed using 
the t -Kbyte data cache 30. The data cache 30 
provices fast on-chip access to frequently used 
dau. In the event that daia are not located in the 
data cache 30. then off-chip references are per- 
formed by the oua interface unit (Bill) 36 using the 
pipelined system bus 4& 

The data cache 30 employs a load scheduling 
technique so that it does not necessarily stall on 
misses. This means that the two execution pipe- 
lines 24.23 can continue processing instructions 



and initiating additional memory raferences *nile 
data is being read from main memory. 

The bus interface unit 38 can receive requests 
for mam memory accesses from either the instruc* 
s Hon processor 12 or execution processor 14. 
These requests are sent to the external pioelined 
bus 4a The external bus can be programmed to 
operate at half the frequency of the microprocessor 
10: this allows for a simple instruction interface at a 
to relatively tew frequency while the microprocessor 
10 executes a pair of instructions at full rate. 

The instruction set of the microprocessor 10 is 
partitioned into a core pan and a non-core part 
The core part of the i n struc ti on set consists of 
rs . performance critical instructions and addressing 
modes, together with some special-function instruc- 
tions for essential system operatione. The non-core 
part consists of the remainder of the instruction set 
Performance critical instructions and addressing 
» modes were selected based on an analysis and 
evaluation of the operating system (UN0( in thie 
case) wonooad and various engineering, scientific 
and embedded controller applications. These 
instructions are executed directly as part of the 
» RISC architecture of microprocessor 10. 

As stated above. special-function and non-core 
instructions are emulated in microprocessor 10 by 
macroinstruction subroutines using sequences of 
core instructions. That is, instructions that are a 
» part of the overall instruction set of the micropro- 
cessor 10 architecture, but that He outside the 
directly-implemented RISC core, are executed un- 
der control of the instruction emulator 20. When the 
Instruction loader 18 encounters a non-core instruc- 
ts tion, it either translates it into a pair of core instruc- 
tions (for simple instructions like MOVB 1(R0).0- 
(R1)) or transfers control to the Instruction emulator 
20. The instruction emulator 20 Is responsible for 
generating a sequence of core instructions that 
40 perform the function of the single, complex instruc- 
tion. 

Fig. 2 shows the structure of the instruction 
cache 22. The instruction cache 22 utilizes a 2- 
way, set-associative organization with 512 entries 

<s for partially decoded instructions. This means that 
for each memory address there are two entries in 
the instruction cache 22 where the instruction lo- 
cated at that address can be placed. The rwo 
entries are called a "set". 

so As shown m Fig. 3. each instruction cache 
entry includes two slots, ue. Slot A and Slot 8. 
Thus, each entry can contain one or two partially- 
decoded instructions that are represented with 
fixed fields for opcode (Opc). source and destina- 

ss tion register numbers (R1 and R2, respectively ), 
and immediate values (32b IMM). The entry also 
includes auxiliary information used to control the 
sequence of instruction execution, including a bit P 
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mat inctcates wnether ttio entry contains rwo con- 
secutive instructions mat cm be executed in par- 
aJtel and a bit G that indicates wnether mo entry is 
for a complex Instruction mat is emulated, and 
additional information representing me length of me 
instruction(s) in a form that allows fast calculation 
of the next instruction's address. 

Referring back to Fig. 2, assrriaisd with each 
entry in me instruction cache 22 is a 26-bit tag, 
TAGO and TAG1. respectively, mat holds me 22 
mc^sJgniflcam bits, 3 least-significant bits and a 
User/Supervisor bit of the virtual address of the 
instruction stored in me entry. In the event that two 
consecutive instructions are paired in an entry, the 
tag corresponds to me instruction at the tower 
address. Associated with the tag are 2 bits that 
indicate whether the entry is vafid and whether it Is 
locked. For each set mere is an additional single 
bit mat Indicates me entry within the set mat Is 
next to be replaced in a Least*Hecentry*Used 
(LRU) order. 

The instruction cache 22 is enabled tor. an 
instruction fetch if a corresponding bit of the con- 
figuration register of microprocessor 10 which is 
used to enable or disable various operating modes 
of me microprocessor 10, is 1 and either address 
translation is disabled or me O-brt is 0 In me level- 
2 Page Table Entry (PTE) used to translate the 
virtual address of me instruction. 

If me instruction cache 22 is disabled, men the 
instruction fetch bypasses me instruction cache 22 
and me contents of me instruction cache 22 are 
unaffected. The instruction is read directty from 
main memory, partially decoded by me instruction 
loader 13 to form an entry (which may contain two 
partially decoded instructions), and transferred to 
me integer pipelines 24, 26 vis me IL BYPASS One 
for execution. 

As shown in fig. 2, if the instruction cache 22 
is enabled for an instruction fetch, then eight bits, 
i.e. bits PC{10:3), of me instruction's address pro- 
vided by the program counter (PC) are decoded to 
select me set of entries where the instruction may 
be stored. The selected set of four entries is read 
and the associated tags are compared with me 23 
most-significant bits, i.e. PC(31:10), and 2 least- 
significant bits PC(1.-C) of the instruction's virtual 
address, tf one of me tags matches and me match- 
ing entry is vaiid, then me entry '« selected for 
transfer to me integer pipelines 24.23 for execution. 
Otherwise, me missing instruction is read dlrectiy 
from main memory and partially decoded, as ex* 
piained below. 

If me referenced instruction is missing from me 
instruction cache 22 and me contents of the se- 
lected set are all locked, men the handling of the 
reference is identical to that described above for 
me case when me instruction cache 22 is disabled. 



If me referenced instruction is missing frcm me 
instruction cache 22 and at least one of the entries 
in me selected set is not locked, men me following 
actions are taken. One of the entries is selected for 
s replacement according to me least recently used 
(LflU) replacement algorithm and men the LflU 
pointer is updated. If me entry selected for replace- 
ment is locked, men me handling of the reference 
is identical to that described above for me case 
jo when me instruction cache 22 is dlsaoied. Other- 
wise, the missing instruction is read directly from 
external memory and then partially decoded by 
instruction loader 18 to form an entry (mat may 
contain two partially decoded instructions) which is 
is transferred to the integer pipelines 24.26 for execu- 
tion. If CIIN is not active during me bus cycles to 
read me missing instruction, then me partiaily de- 
coded instruction is also written into the instruction 
cache entry selected for replacement me asso- 
20 dated valid bit is set and me entry is locked if 
Lock-lrtstruction-Cache bit CFG.UC in ma configu- 
ration register Is 1. 

After the microprocessor 10 has completed 
fetching a missing instruction, from external main 
2S memory, it will continue prefetching sequential 
instructions. For subsequent sequential instruction 
fetches; the microprocessor 10 searches the in- 
struction cache 22 to determine whether me in- 
struction is located on-chip, tf the search is suc- 
M cessfuJ or a non-sequential Instruction fetch occurs, 
then the microprocessor 10 ceases prefetcning. 
Otherwise, me prefetched instructions are rapidly 
available for decoding and executing. The micro- 
processor 10 initiates prefetches only during bus 
35 cycles mat would otherwise be Idle because no off- 
chip data references are required. 

It is possible to fetch an instruction and lock it 
into me instruction cache 22 without having to 
execute the instruction. This can be accomplished 
40 by enabling a Debug Trap (08G) for a Program 
Counter value that matches two instruction's ad- 
dress. Debug Trap Is a service routine that per- 
forms actions appropriate to this type of exception. 
At me conclusion of the DBG routine, the REturn to 
46 Execution (RETX) Instruction is executed to resume 
executing instructions at the point where me excep- 
tion was recognized. The instruction will be fetched 
and placed into me Instruction Cache 32 before me 
trap is processed, 
so When me instruction which Is locked !n the 
instruction cache 22 gets to execution and a Debug 
Trap on mat instruction Is enabled, instead of ex- 
ecuting me instruction, me processor will jump to 
the Debug Trap service routine. The service routine 
S6 may set a breakpoint for the next instruction so mat 
when me processor returns from me service rou- 
tine, it will not execute me next instruction but 
rather will go again to the Debug Trap routine. 
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The process descnbed above, wnicn usually 
gats executed dunng system bcctstrao. allows the 
user to store routines in tha instruction cacne 2Z 
iccic them and htva them ready for operation with- 
out executing tham during tha locking process. 

Further information relating to tha architecture 
of microprocessor 10 and its cache kxking capabit* 
itias is provided in commonly-assigned application 

Sariai No. . filed on tha same data aa 

this application and titled SELECTIVELY LOCKING 
MEMORY LOCATIONS WITHIN A MICROPRO- 
CESSOR'S ON-CHIP CACHE: tha juat-refarenced 
application sariai No. is hereby in- 
corporated by reference to provide further back- 
ground information regarding tha present invention. 

Tha contents of tha instruction cacha 22 can 
be invalidated by software or by hardware. 

Tha instruction cacha 22 is invalidated by soft- 
ware as fellows: Tha entire instruction cacha con- 
tents, including locked entries, are invalidated while 
bit CFQJC of tha Configuration Register is 0. Tha 
LRU replacement information is also initialled to 0 
while bit CFGJC is 0. Cacha Invalidate C1NV in- 
struction can ba executed to invalidate tha errtira 
instruction cacha contents. Executing C1NV invali- 
dates either tha entire cacha or only unlocked tinea 
a ccor ding the instruction's l-option. 

' The entire instruction cacha 22 is invalidated in 
hardware by activating an 1NV1C input signal. 

Fig. 3 shows a simplified view of a partially 
decoded entry stored in tha instruction cache 22. 
As shown in Ftg. 3. each entry has two slots for 
instructions. Slot A controls integer pipeline 24 and 
tha port to data cacha 30. Slot 8 controls the 
second integer pipe 23, or one of the floating point 
units or a control transfer instruction. Slot 8 can 
also control the port to data cache 30. but only if 
slot A is not using the data cache 30. As stated 
above, instruction loader 18 retrieves encoded 
instructions from their compact format In main 
memory and places them into slots A and 8 se- 
cerning to their functions. 

Thus, in accordance with the present invention, 
the novel aspects of instruction cache 22 Include 
(1) partially decoding instructions for storage In 
cache memory, (2) placing of instructions into two 
cache slots according to their function and (3) 
placing auxiliary information in the cache entries 
along with the instructions to control parallel execu- 
tion and emulation of complex instructions. 

As further shown in Rg. 3, a bit P in each 
instruction cache entry indicates whether the 
instructions in slots A and 8 are independent so 
they can be executed in parallel, or dependent so 
they must be executed sequentially. 

An example of independent instructions that 
can be executed in parallel is: 
Load 4<R0)^1 ; Added 4.R0 



An example of dependent instructions requmng 
seouenbai execution is: 
Aodd R0. Rt ;Acdd R1.R2 

Using a single bit for this purpose allows two 
s dependent instructions to be stored in the slots of a 
single cache entry, otherwise, the two instructions 
would have to be stored In separate entries and 
only 1/2 of the instruction cache 22 would be 
utilized in those two en tries. 

10 Ftg. 3 also showe a bit Q in each instruction 
cache entry that indicates whether the instructions 
in slots A and 8 are emulating a single, mora 
complex instruction from main memory. For exam- 
ple, the loader translates the single instruction 

is AOOO O<R0K R1 into the following pair of instruc- 
tions in slots A and 8 and sets the sequential and 
emulation Rags in the entry: 
Load 0(R0). Temp 
AOOO Temp, R1 

30 In accordance with the pipelined organization 
of the microprocessor 10. every instruction ex- 
ecuted by the rnicroprocessor 10 goes through a 
seriea of stages. The two integer pipelines 24, 28 
(Rg. 1) are able to work in parallel on instructions 

sa pairs. Integer unit 24 and integer unit 28 are not 
identical the instructions that can ba executed in 
integer unit 24 being a sub-set of those that can be 
executed In Integer unit 20* 

As stated above, instruction fetching is par- 

oo formed by the instruction loader 18 which stores 
decoded instructions in the instruction cache 22. 
The integer dual-pipe receives decoded instruction- 
pairs for execution. 

Referring again to Ftg. 3, as stated above, an 

3$ instruction pair consists of two slots: Slot A and 
Slot 8. The inatruction In Slot A is scheduled for 
integer unit 24; the instruction in Slot 8 is sched- 
uled for integer unit 28. Two instructions belonging 
to the same pair advance at the same time from 

40 one stage of the integer pipeline to the next except 
In the case when the Instruction in Slot 8 is de- 
layed in the Instruction decode stage of the pipe- 
line as described below. In this case, the instruc- 
tion in integer pipeline 24 can advance to the 

45 following pipeline stages. However, new instruc- 
tions cannot enter the pipeline until the instruction 
decode stage is free in both pipeline unit 24 and 
pipeline unit 28. 

Although the unit 24 and unit 28 instructions 

so are executed in parallel (except in the case of the 
staU I O-B instruction), the Slot A instruction always 
logically precedes the corresponding Slot 8 in- 
struction and, if the Slot A instruction cannot be 
completed due to an exception, then the corre- 

ss spending Slot 8 instruction is discarded. 

Referring to Fig. 4. each of the integer pipeline 
units 24. 28 includes four stager an instruction 
decode stage (10). an execute stage (EX), a mem* 
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ory access stag* (ME) and a store result stage 
(ST}. 

An instruction is fad into the 10 stage of ma 
integer unit for wnicft it .is scheduled where its 
decoding is completed and register source 
operands are read. In the EX stage, me 
anthmetic/togicai unit of me microprocessor 10 is 
activated to compute me instruction's results or to 
compute me effective memory address for 
Uad/Store instructions. In me ME stage, the data 
cache 30 (Rg. t) is accessed by LoaoVStDre 
instructions and exception conditions are checked. 
In the ST stage, results are written to me register 
file, or to the data cache 30 in the case of a Store 
instruction, and Program Status Register (PSR) 
flags are updated. At mis stage, the instruction can 
no longer be undone. 

As further shown in Rg. 4. results from the EX 
stage and me ME stage can be fed back to the 10 
stage, thus enabling instruction latency of 1 or 2 
cycles. 

In the absence of any delays, me dual execu- 
tion pipeline of microprocessor 10 accepts a new 
instruction pair every clock cycle (I.e., peak 
throughput of two instructions per cycle) and 
scrolls all other instructions down one stage along 
the pipeline. The dual pipeline includes a global 
stalling mechanism by which any functional unit 
can stall the pipeline if it detects a hazard. Each 
stalls me corresponding stage and all stages pre* 
ceding it for one more cycle. When a stage stalls, 
ft keeps me instruction currently residing in it for 
another cycle and men restarts all stage activities 
exactly as in me non*staiIed case. 

The pipeline unit on which each instruction is 
to be executed is determined at run time by. the . 
instruction loader 18 when instructions are fetched 
from main memory. 

The instruction loader 18 decodes prefetched 
instructions, tries to pack them Into Instruction pair 
entries and presents mem to the dual-pipetina. If 
the Instruction cache 22 Is enabled (as ifcfgj md 
above), cacheabie instructions can be stored n the 
instruction cache 22. tn this case, an entry contain- 
ing an instruction pair or a single instruction is also 
sent to me instruction cache 22 and stored there as 
a single cache entry. On instruction cache hits, 
stored instruction pairs are retrieved from the in- 
struction cache 22 and presented to me duai-pipe- 
!ine for execution. 

The instruction loader 18 attempts to 
instructions into pairs whenever possible. The 
packing of two instructions into one entry is possi- 
ble only rf me first instruction can be executed by 
integer pipeline unit 24 and both instructions are 
less man a preselected maximum length, tf it is 
impossible to pack two instructions into a pair, then 
a single instruction is placed in Slot 8. 



Two instructions can be paired cnly when ail of 
me following conditions hold: (1) bom instructions 
are performance-cnticai core instructions. (2) me 
first instruction is executable by integer pipeline 
5 * unit 24. and (3) me displacement and immediate 
fields in bcth instructions use short-encoding (short 
encoding for all instructions except the Branch in- 
duction is 11 bits and 17 bits tor me Conditional 
Branch and Branch and Link instructions). 
J o Several instructions of the microprocessor 10 
instruction set are restricted to run on integer pipe- 
line unit 28 only. For example, because instruction 
pairs in the instruction cache 22 are tagged by me 
Slot A address, it is not useful to put a Branch 
78 instruction in Slot A since me corresponding Slot 8 
irisfructicn will not be accessible. Similarly, since 
mere is a single arithmetic floating point pipe, it is 
not possible to execute two arithmetic floating point 
instructions in parallel Restricting these instruc- 
» tions to integer pipeline unit 28 makes it possible 
to considerably simplify me dual-pipe data path 
design without hurting performance. 

Integer unit 28 can execute any instructions in 
me microprocessor 10 instruction set 
» The instruction loader 18 initiates instruction 
pairing upon an instruction cache miss, in which 
case it begins prefetching instructions into an in- 
struction queue, in parallel, me instruction loader 
18 examines me next Instruction not yet removed 
w from me instruction queue and attempts to pack it 
according to me following algorithm: 

Step 1: Try to fit me next instruction Into Slot 

A. 

(a) ff me next Instruction is not performance 
a critical, men go to Step 5. 

(b) remove the next instruction from me instruc- 
tion queue and tentatively place it In Slot A. 

(c) if the Instruction is illegal for Slot A or if me 
instruction has an immediate/displacement field 

<o that cannot be represented in 1 1 bits, or if me 
instruction is not quad* word aligned, then go to 
Step 4. 

(d) otherwise, continue to Step 2. 

Step 2; Try to fit the next instruction into Slot 

4a 8. 

(a) if the next instruction is not performance- 
critical, or me next instruction has an encoded 
immediate/displacement field longer man 11 
bits, or the next instruction is a branch with 

50 displacement longer man 17 bits, men go to 
Step 4. 

(b) otherwise, remove me next instruction from 
the instruction queue, place it in Slot B and go 
to Step 3. 

54 Step 3: Construct an instruction pair entry. 

In this case, bom Slot A and Slot 8 contain 
valid instructions and ail pairing conditions are sat- 
isfied. Issue a pair entry and go to Step 1 . 
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Steo 4: Construct a »ngt# instruction entry. 

,n ]TUS - C * S * 9 Stet A contain9 instruction 
which cannot be pain*. Move this instruction to 
Slot 8. It this instruction contains an 
immediat*disoiacament field longer than 17 bits. $ 
or it is a branch with emplacement longer than 17 
bits, and is not quad-word aligned, than replace it 
with UNOeflned. Issue the entry and go to Step i. 

Step 5: Handle nor^erformance-cnticai 
instructions." JQ 

Remove the next instruction from the instruc- 
tion queue and send it to the instruction emulator 
20. When finished with this instruction, go to Step 
1. 

The just-described pairing algorithm packs two rs 
instructions whenever they can be held in a single 
instruction cache entry. However, these instructions 
may nappen to be dependent, in which case they 
cannot be executed in parallel. The dependencies 
are detected by the execution processor 14. & 

It should be understood that various alterna- 
tives to the embodiment of the invention described 
herein may be utilized in practicing the invention, it 
is intended that the following claims define the 
scope of the invention and that methods and ap- as 
paratus within the scope of these claims and their 
equivalents be covered thereby. 

CUims 

1. A processor that executes instructions re* 
trteved from a main memory external to the 
processor from an internal instruction cache 
memory, the processor comprising: 

(a) means for retrieving an encoded instruc* m 
tion from the main memory; 

(b) means for decoding the encoded in* 
struction retrieved from main memory: 

(c) internal cache memory storage means 

for storing the decoded instruction; and <o 

(d) means tor retrieving the decoded In- 
struction from the internal cache memory 
storage means for execution by the proces- 
sor. 

Z A microprocessor that executes instructions re- 
trieved from a main memory external to the 
microprocessor or from an internal instruction 
cache memory, the microprocessor compris- 
ing: 50 

(a) a plurality of functional units for execut- 
ing instructions; 

(b) means for retrieving encoded instruc- 
tions from main memory: 

(c) means for decoding the encoded ss 
instructions retrieved from main memory: 

(d) internal cache memory storage means 
compnsing a plurality of storage locations. 



each storage location comprising a plurality 
of storage slots, each of the storage slots 
comprising means for storing a decoded 
instruction; and 

(e) means for simultaneously retrieving a 
plurality of decoded instructions from the 
storage slots of a selected cache memory 
storage location for parallel execution by the 
plurality of functional units. 

X A microprocessor as in claim 2 wherein each 
of the cache memory storage locations in- 
cludes means for storing auxiliary information 
indicative of whether the plurality of instruc- 
tions stored in the slots a cache memory stor- 
age location are independent such that the 
instructions may be executed in parallel, or 
dependent such thai the instructions must be 
executed sequentially. 

4, A method of executing instructions by a pro- 
cessor mat retrievee instructions from a main 
memory external to the processor or from an 
internal instruction cache memory, the method 
comprising: 

(a) retrieving an encoded instruction from 
the main memory; 

(b) decoding the encoded instruction re- 
trieved from main memory; 

(c) storing the decoded instruction in an 
internal cache memory; and 

(d) retrieving the decoded instruction from 
the internal cache memory for execution by 
the processor. 

5. A method of executing instructions by a micro- 
processor that retrieves instructions from a 
main memory external to the microprocessor 
or from an internal instruction cache memory, 
the microprocessor inducing a plurality of 
functional units for executing instruction, the 
method comprising: 

(a) retrieving encoded instructions from 
main memory; 

(b) decoding the instructions retrieved from 
main memory; 

(c) storing the decoded Instructions in an 
internal cache memory storage means cam- 
prising a plurality of storage locations, each 
storage location comprising a plurality of 
storage slots, each of the storage slots 
comprising means for storing a decoded 
instruction; and 

(d) simultaneously retrieving a plurality of 
decoded instructions from the storage slots 
of a selected cache memory storage kxatio 
for parallel execution by the plurality of 
functional units. 
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A mtthod as in daim 5 and including the step 
of storing auxiliary information in the cache 
memory storage locations, the auxiliary Infor- 
mation being, indicative; of whether the plurality 
of instructions stored In the slots of a cache 
memory storage location are Independent such 
mat the instructions may be executed in par- 
allel, or dependent such mat the Instructions 
must be executed sequentially. 
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0 Partially decoded Instruction 
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0 A microprocessor partially decodes instructions 
retrieved from main memory before placing them 
into the microrjrccessor's integrated instruction 
cache. Each storage location in the instruction cache 
includes two slots for decoded instructions. One slot 
controls one of the microrxocessor*s integer pipe- 
tines and a port to the microprocessor*! data cache. 
A second slot controls the second integer pipeline or 
one of the microprocessor's floating point units. The 
instructions retrieved from main memory are de- 
coded by a leader unit which decodes the instruc- 
tions from the compact form as stored in main 



memory and places them into the two slots of the 
instruction cache entry according to their functions. 
In addition, auxiliary information is placed in the 
cache entry along with the iristruction to control 
parallel execution as weU as emulation of complex 
instructions. A bit in each instruction cache entry 
indicates whether the instructions in the two slots are 
independent, so that tney can be executed in par- 
allel or dependent so that they must be executed 
sequentially. Using a single bit for this purpose al- 
lows two dependent instructions to be stored in the 
slots of the single cache entry. 
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